Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

General Chinese document capture system with improved error-rejecting module

Identifieur interne : 001792 ( Main/Exploration ); précédent : 001791; suivant : 001793

General Chinese document capture system with improved error-rejecting module

Auteurs : DAHAI LUAN [République populaire de Chine] ; CHANGSONG LIU [République populaire de Chine] ; XIAOQING DING [République populaire de Chine]

Source :

RBID : Pascal:03-0420442

Descripteurs français

English descriptors

Abstract

This paper introduces a newly designed general-purpose Chinese document data capture system - Tsinghua OCR (Optical Character Recognition) Network Edition (TONE). The system aimed to cut down the high cost in the process of digitalizing mass Chinese paper documents. Our first step was to divide the whole data-entry process into a few single-purpose procedures. Then based on these procedures, a production-line-like system configuration was developed. By design, the management cost was reduced directly by substituting automated task scheduling for traditional manual assignment, and indirectly by adopting well-designed quality control mechanism. Classification distances, character image positions, and context grammars are synthesized to reject questionable characters. Experiments showed that when 19.91% of the characters are rejected, the residual error rate could be 0.0097% (below one per ten thousand characters). This finally improved the error-rejecting module to be applicable. According to the cost distribution (specially, the manual correction occupies 70% of total) in the data companies, the estimated total cost reduction could be over 50%.


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">General Chinese document capture system with improved error-rejecting module</title>
<author>
<name sortKey="Dahai Luan" sort="Dahai Luan" uniqKey="Dahai Luan" last="Dahai Luan">DAHAI LUAN</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>State Key Laboratory of Intelligent Technology and Systems, Dept. of Electronic Engineering, Tsinghua University</s1>
<s2>Beijing 10084</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Changsong Liu" sort="Changsong Liu" uniqKey="Changsong Liu" last="Changsong Liu">CHANGSONG LIU</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>State Key Laboratory of Intelligent Technology and Systems, Dept. of Electronic Engineering, Tsinghua University</s1>
<s2>Beijing 10084</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Xiaoqing Ding" sort="Xiaoqing Ding" uniqKey="Xiaoqing Ding" last="Xiaoqing Ding">XIAOQING DING</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>State Key Laboratory of Intelligent Technology and Systems, Dept. of Electronic Engineering, Tsinghua University</s1>
<s2>Beijing 10084</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">03-0420442</idno>
<date when="2003">2003</date>
<idno type="stanalyst">PASCAL 03-0420442 INIST</idno>
<idno type="RBID">Pascal:03-0420442</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000602</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000189</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000556</idno>
<idno type="wicri:doubleKey">1017-2653:2003:Dahai Luan:general:chinese:document</idno>
<idno type="wicri:Area/Main/Merge">001870</idno>
<idno type="wicri:Area/Main/Curation">001792</idno>
<idno type="wicri:Area/Main/Exploration">001792</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">General Chinese document capture system with improved error-rejecting module</title>
<author>
<name sortKey="Dahai Luan" sort="Dahai Luan" uniqKey="Dahai Luan" last="Dahai Luan">DAHAI LUAN</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>State Key Laboratory of Intelligent Technology and Systems, Dept. of Electronic Engineering, Tsinghua University</s1>
<s2>Beijing 10084</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Changsong Liu" sort="Changsong Liu" uniqKey="Changsong Liu" last="Changsong Liu">CHANGSONG LIU</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>State Key Laboratory of Intelligent Technology and Systems, Dept. of Electronic Engineering, Tsinghua University</s1>
<s2>Beijing 10084</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Xiaoqing Ding" sort="Xiaoqing Ding" uniqKey="Xiaoqing Ding" last="Xiaoqing Ding">XIAOQING DING</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>State Key Laboratory of Intelligent Technology and Systems, Dept. of Electronic Engineering, Tsinghua University</s1>
<s2>Beijing 10084</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
<imprint>
<date when="2003">2003</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Chinese</term>
<term>Data acquisition</term>
<term>Document processing</term>
<term>Error detection</term>
<term>Language</term>
<term>Optical character recognition</term>
<term>System architecture</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Reconnaissance optique caractère</term>
<term>Traitement document</term>
<term>Saisie donnée</term>
<term>Détection erreur</term>
<term>Architecture système</term>
<term>Chinois</term>
<term>Langage</term>
<term>TONE (Tsinghua OCR Network Edition)</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Langage</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This paper introduces a newly designed general-purpose Chinese document data capture system - Tsinghua OCR (Optical Character Recognition) Network Edition (TONE). The system aimed to cut down the high cost in the process of digitalizing mass Chinese paper documents. Our first step was to divide the whole data-entry process into a few single-purpose procedures. Then based on these procedures, a production-line-like system configuration was developed. By design, the management cost was reduced directly by substituting automated task scheduling for traditional manual assignment, and indirectly by adopting well-designed quality control mechanism. Classification distances, character image positions, and context grammars are synthesized to reject questionable characters. Experiments showed that when 19.91% of the characters are rejected, the residual error rate could be 0.0097% (below one per ten thousand characters). This finally improved the error-rejecting module to be applicable. According to the cost distribution (specially, the manual correction occupies 70% of total) in the data companies, the estimated total cost reduction could be over 50%.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>République populaire de Chine</li>
</country>
<settlement>
<li>Pékin</li>
</settlement>
</list>
<tree>
<country name="République populaire de Chine">
<noRegion>
<name sortKey="Dahai Luan" sort="Dahai Luan" uniqKey="Dahai Luan" last="Dahai Luan">DAHAI LUAN</name>
</noRegion>
<name sortKey="Changsong Liu" sort="Changsong Liu" uniqKey="Changsong Liu" last="Changsong Liu">CHANGSONG LIU</name>
<name sortKey="Xiaoqing Ding" sort="Xiaoqing Ding" uniqKey="Xiaoqing Ding" last="Xiaoqing Ding">XIAOQING DING</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001792 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001792 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:03-0420442
   |texte=   General Chinese document capture system with improved error-rejecting module
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024